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In the context of density level set estimation, we study the convergence of general plug-in 
methods under two main assumptions on the density for a given level A. More precisely, it 
is assumed that the density (i) is smooth in a neighborhood of A and (ii) has 7-exponent at 
level A. Condition (i) ensures that the density can be estimated at a standard nonparametric 
rate and condition (ii) is similar to Tsybakov's margin assumption which is stated for the 
classification framework. Under these assumptions, we derive optimal rates of convergence for 
plug-in estimators. Explicit convergence rates are given for plug-in estimators based on kernel 
density estimators when the underlying measure is the Lebesgue measure. Lower bounds proving 
optimality of the rates in a minimax sense when the density is Holder smooth are also provided. 

Keywords: density level sets; kernel density estimators; minimax lower bounds; plug-in 
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1. Introduction 

Let Q be a positive cr-finite measure on X <Z W 1 . Consider i.i.d. random vectors 
(Xi, . . . , X n ) with distribution P. having an unknown probability density p with re- 
spect to the measure Q. For a fixed A > 0, we are interested in the estimation of the 
X-level set of the density p defined by 

T p (X) = {xeX:p(x)> X}. (1.1) 

Throughout the paper, we fix A > and when no confusion is possible, we use the notation 
L(A), or simply F, instead of L p (A). When Q is the Lebesgue measure on R d , density 
level sets typically correspond to minimum volume sets of given P-probability mass, as 
shown in Polonik (1997). 

Remark 1.1. A somewhat preponderant definition of a density level set is 

r(A) = {x G X :p(x) > A}, (1.2) 
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that is, the union of T(A) and the set {x <G X : p(x) = A}. Since, in this paper, the 
density is allowed to have flat parts at level A, the sets T(A) and T(A) can differ by 
an arbitrarily large set. Density level sets defined by (1.1) or (1.2) can be estimated 
using plug-in estimators with positive or negative offset, respectively (see Section 2.2). 
However, definition (1.1) remains consistent with the definition of the support of the 
density when A = 0. The results detailed hereafter pertain only to this definition, but are 
applicable to definition (1.2) after minor changes. 

The following are two possible applications of density level set estimation. 

Anomaly detection. The goal is to detect an abnormal observation from a sample (see, 
e.g., Steinwart et al. (2005) and references therein). One way to deal with that problem 
is to assume that abnormal observations do not belong to a group of concentrated 
observations. In this framework, observations are considered to be abnormal when they 
do not belong to T(A) for some fixed A > 0. The special case A = 0, which corresponds 
to support estimation, has been examined by Devroye and Wise (1980), for example. 
In the general case, A can be considered a tolerance level for anomalies: the smaller A, 
the fewer observations are considered to be abnormal. 

Unsupervised or semi-supervised classification. These two problems amount to 
the identification of areas where the observations are concentrated with possible use of 
some available labels for the semi-supervised case. For instance, it can be assumed that 
the connected components of T(A), for a fixed A, are clusters of homogeneous observa- 
tions, as described in Hartigan (1975). Note that this definition has been refined, for 
example in Stuetzle (2003), and has been studied with plug- in estimators in Rigollct 
(2007). 

Remark 1.2. In both applications, the choice of A is critical and must be addressed 
carefully. However, this problem is beyond the scope of this paper. 

There are essentially two approaches toward estimating density level sets from the 
sample (X\, . . . , X n ). The most straightforward is to resort to plug-in methods where 
the density p in the expression for T p (A) is replaced by its estimate computed from the 
sample. Another way to estimate density level sets is to resort to direct methods which 
are based on empirical excess-mass maximization. The excess-mass H is a functional 
that measures the quality of an estimator G and is defined as follows Hartigan (1987); 
Muller and Sawitzki (1987): 

H(G)=P(G)-\Q(G). 

Excess-mass measures how the P-probability mass concentrates in the region G and 
it is maximized by T = T(A). Hence, it acts as a risk functional in the density level 
set estimation (DLSE) framework and it is natural to measure the performance of an 
estimator G by its excess-mass deficit H(T) — H(G) > 0. Further justifications for the 
well-foundedness of the excess-mass criterion can be found in Polonik (1995). Recently, 
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Gayraud and Rousseau (2005) proposed a Bayesian approach to DLSE, together with 
interesting comparative simulations. 

While local versions of direct methods have been analyzed deeply and proven to be opti- 
mal in a minimax sense over a certain family of well-behaved distributions (see Tsybakov 

(1997) ), and although reasonable implementations have been recently proposed (see, 
e.g., Steinwart et al. (2005)), they are still not very easy to use for practical purposes, 
compared to plug-in methods. Indeed, in practice, rather than specifying a value for 
A, the user can specify a value for a, the P-probability mass of the level set. In this 
case, the value of A is implied by that of a and efficient direct methods can be derived 
(Scott and Nowak (2006)). However, in general, using direct methods, one must run an 
optimization procedure several times, for different density level values, then choose a pos- 
teriori the most suitable level according to the desired rejection rate. Plug-in methods do 
not involve such a complex process: the density estimation step is only performed once 
and the construction of a density level set estimate simply amounts to thresholding the 
density estimate at the desired level. 

On the other hand, in the related context of binary classification, where more theoret- 
ical advances have been developed, the different analyses proposed thus far have mainly 
supported a belief in the superiority of direct methods. Yang (1999) shows that, under 
general assumptions, plug-in estimators cannot achieve a classification error risk conver- 
gence rate faster than 0(1/ y/n) - where n is the size of the data sample - and suffer from 
the curse of dimensionality. In contrast to that, under slightly different assumptions, di- 
rect methods achieve this rate 0(l/y/n), whatever the dimensionality (see, e.g., Vapnik 

(1998) ; Devroye et al. (1996); Tsybakov (2004)), and can even reach faster convergence 
rates - up to 0(l/n) - under Tsybakov's margin assumption (see Mammen and Tsybakov 

(1999) ; Tsybakov (2004); Tsybakov and van de Geer (2005); Tarigan and van de Geer 
(2006)). This contributed to the arousing of some pessimism concerning plug-in meth- 
ods. Nevertheless, such a comparison between plug-in methods and direct methods is far 
from legitimate, since the aforementioned analyses of both plug-in methods and direct 
ones have been carried out under different sets of assumptions (those sets are not disjoint, 
but none of them is included in the other). 

Recently, in the standard classification framework, Audibert and Tsybakov (2007) have 
combined a new type of assumption dealing with the smoothness of the regression func- 
tion and the well-known margin assumption. Under these assumptions, they derive fast 
convergence rates - even faster than 0(l/?i) in some situations - for plug-in classification 
rules based on local polynomial estimators. This new result reveals that plug-in methods 
should not be considered inferior to direct methods and, more importantly, that this new 
type of assumption on the regression function is a critical point in the general analysis 
of classification procedures. 

In this paper, we extend such positive results to the DLSE framework: we revisit the 
analysis of plug-in density level set estimators and show that they can be also very efficient 
under smoothness assumptions on the underlying density function p. Unlike the global 
smoothness assumption used in Audibert and Tsybakov (2007), the local smoothness 
assumption introduced here emphasizes the predominant role of the smoothness close 
to the level A, as opposed to the smoothness for values of p far from the level under 
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consideration. Related papers are Baillo et al. (2001) and Bai'llo (2003), which investigate 
plug-in estimators based on a certain type of kernel density estimate. Bai'llo et al. (2001) 
also study the convergence for the symmetric difference under other assumptions and 
Baillo (2003) derives almost sure rates of convergence for a quantity different from the 
ones studied here. It is interesting to observe that she introduces a condition similar to 
the 7-exponent used here. 

The particular case A = corresponds to estimation of the support of density p and is 
often applied to anomaly detection. Following the pioneering paper of Devroye and Wise 
(1980), this problem has received more attention than the general case A > and has 
been treated using plug-in methods, for example by Cuevas and Fraiman (1997). Unlike 
the previously cited papers, we derive rates of convergence and prove that these rates 
are optimal in a minimax sense. However, we do not treat the case A = for the which 
the rates are typically different than for A > 0, as pointed out by Tsybakov (1997), for 
example. The techniques employed in the present analysis can be refined to encompass 
the case A = and the results will be published separately. 

A general plug- in approach has been studied previously by Molchanov (1998), where 
a result on the asymptotic distribution of the Hausdorff distance is given. In a recent 
paper, Cuevas et at. (2006) study general plug-in estimators of the level sets. Under very 
general assumptions, they derive consistency with respect to the Hausdorff metric and 
the measure of the symmetric difference. However, this very general framework does not 
allow them to derive rates of convergence. 

This paper is organized as follows. Section 2 introduces the notation and definitions. 
Section 3 presents the main result, that is, a new bound on the error of plug-in estimators 
based on general density estimators that satisfy a certain exponential inequality. In Sec- 
tion 4, we then apply this result to the particular case of kernel density estimators, under 
the assumption that the underlying density belongs to some locally Holder smooth class 
of densities. Finally, minimax lower bounds are given in Section 5, as a way to assess the 
optimality of the upper bounds involved in the main result. 

2. Notation and setup 

For any vector x 6 M d , denote by x^ its jth coordinate, j = 1, . . ., d. Denote by || ■ || the 
Euclidean norm in R d and by B(x,r) the closed Euclidean ball in X centered at x G X 
and of radius r > 0. 

The probability and expectation with respect to the joint distribution of (X\, . . . ,X n ) 
are denoted by P and E, respectively. For any function / : M. d — >• R, we denote by ||/||oo = 
sup^gjjd |/(x)| the sup-norm of / and by ||/|| = (J Rd f 2 (x) dx) 1 ^ 2 its L2-norm. Also, for 
any measurable function / on X and any set A C f{X), we write, for simplicity, {x € 
X : f{x) = A}. Throughout the paper, we denote by C positive constants that 
can change from line to line and by Cj positive constants that have to be identified. 
Finally, A c denotes the complement of the set A. 

Our choice of measure of performance will affect the construction of our estimator, so 
we begin by discussing this topic. 
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2.1. Measures of performance 

Recall that Q is a positive er-finite measure on X and define the measure Q\ that has 
density \p(-) — A| with respect to Q. To assess the performance of a density level set 
estimator, we use the two pseudo-distances between two sets G\ and G2 C X: 

(i) the Q-measurc of the symmetric difference between G\ and G2, 

d A (G 1 ,G 2 ) = Q(G 1 AG 2 ); 

(ii) the Q>-measure of the symmetric difference between G\ and G2, 

d H (G 1 ,G 2 ) = Qx(G 1 AG 2 )= f \ P (x)-X\dQ(x). 

J G1AG2 

The quantity d^{G\ 1 G 2 ) is a standard and natural way to measure the distance between 
two sets Gi and G2. Note that for any measurable set G C X, the excess-mass H(G) can 
be written 

H{G) = f (p(aO-A)dQ(x). 
Jg 

Thus, we can rewrite 

H(T)-H(G)= f (l M . ) > x} (x)-t 6 (x))(p(x)-X)dQ(x) 
J x 

= f \p(x)-X\dQ(x) = d H (G,T). 

J FAG 

This explains the notation du- 

The following definition introduces a quantity which critically controls the complexity 
of the problem and therefore the attainable rates of convergence. In particular, it allows 
us to link (iff to ^a- 

Definition 2.1. For any A, 7 > 0, a function f : X — > R is said to have 7-exponent at 

level A with respect to Q if there exist constants cq > and eo > such that, for all 
0<e<e , 

Q{x eX:0< \f(x) -X\<e}< c e 7 . 

The assumption under which the underlying density has 7-exponent at level A was 
first introduced by Polonik (1995). Its counterpart in the context of binary classifica- 
tion is commonly referred to as margin assumption (see Mammen and Tsybakov (1999); 
Tsybakov (2004)). 

The exponent 7 controls the slope of the function around level A. When 7 = 0, the con- 
dition holds trivially and when 7 is positive, it constrains the rate at which the function 
approaches the level A. A standard case corresponds to 7 = 1, arising, for instance, in the 
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case where the gradient of / has a coordinate bounded away from in a neighborhood 
of {/ = A}. 

We now show that the pseudo-distances c?a and dn are linked when the density p has 7- 
cxponent at level A. The following proposition is a direct consequence of Proposition A.l. 

Proposition 2.1. Fix A > and 7 > 0. // the density p has 7 -exponent at level A 
w.r.t. Q, then, for any Lq > 0, there exists C > such that for any G\,G 2 satisfying 
Q(GiAG 2 ) < Lq, we have 

d A (G 1 ,G 2 ) < Q(G 1 AG 2 n {p = A}) + C{d H {G 1 , G 2 ))^ (1+ A 

Note that for any density level set estimator G, it holds that <i#(G,r(A)) = 
dn(G, r(A)). In other words, the choice of definition of the density level set will not 
affect the performance of an estimator when measured by its excess-mass deficit. How- 
ever, the distance g?a is very sensitive to this choice, as illustrated in Section 2.2, and 
one must resort to offsets to control the first term on the right-hand side of the result in 
Proposition 2.1. 

2.2. Plug-in density level set estimators with offset 

For a fixed A > 0, the plug-in estimator of T(A) is defined by 

f(\) = {xeX:p n (x)>\}, 

where p n is a nonparametric estimator of p. For example, p n can be a kernel density 
estimator of p, 

1 " 

where K : W 1 — > IR is a suitably chosen kernel and ft. > is the bandwidth parameter. For 
reasons that will be made clear later, we consider the family of plug-in estimators with 
offset £ n , denoted by Ti n and defined as 

f ,„ = f (n (A) - f (A + i n ) = {x G X :p n {x) > X + M, 

where £ n is a quantity that typically tends to as n tends to infinity. 

As mentioned in Remark 1.1, when the goal is to estimate the set T(A), the offset 
l n is chosen to be positive, whereas for T, it must be chosen negative. The effect of 
such choices is to ensure that the set {p = A} is, respectively, removed or added to the 
standard plug-in estimator with high probability. This phenomenon emerges only when 
the performance is measured using pseudo-distance c^a, but not dn ■ This translates into 
different optimal choices for the offset, as well as different optimal rates of convergence 
depending on the chosen measure of performance. 
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The following counterexample suggested by an anonymous referee demonstrates that 
standard plug-in estimators can fail to consistently estimate the set {p = A} . Assume that 
^CK, that the density p is such that p{x) = 1/2 for all x G [0, 1] and that p(x) < 1/2 
elsewhere. In this case, it is clear that T(l/2) = and T(l/2) = [0, 1]. Assume, now, that 
we have an estimator p such that \p(x) — p(x)\ < e, where e > is arbitrary small. If 
p(x) = 1/2 + e for any x € [0, 1], then T(l/2) D [0, 1] and it fails to consistently estimate 
r(l/2) as e tends to 0. However, Tg n with a positive offset l n > e can become consistent, 
as shown in Section 3. Conversely, if p(x) = 1/2 — e for any x € [0, 1], then f(l/2) is not 
a consistent estimator of T(l/2), but Yt n with a negative offset l n < — e can be one. 

As a consequence, plug-in density level set estimators can match both definitions of 
density level sets (1.1) or (1.2) by simply changing the sign of the offset. 

3. Fast rates for plug-in density level set estimators 
with offset 

The first theorem states that rates of convergence for plug-in estimators with offset can 
be obtained using exponential inequalities for the corresponding nonparametric density 
estimator p n . In what follows, smoothness in the neighborhood of the level under con- 
sideration is particularly important and we define this neighborhood as follows: 

V{ V ) = {pe(X- V ,X + r 1 )}, 77>0. 

In the sequel, we write, for simplicity, T( n = T when the value of the offset is clear from 
the context. 

The following definition will be at the center of our main theorem. It provides a compact 
way to describe the pointwise convergence of an estimator to the true density p. 

Definition 3.1. Let V be a given class of probability densities of X and fix A > 0. Let 

<f = (Vn) an d ip = {tpn) be two positive, monotonically non-increasing sequences. 

We say that an estimator p n is pointwise convergent at a rate (ipn) uniformly over V 
if there exist positive constants c\, c%, c^ such that for Q-almost all x € X , we have 

swp¥(\p n (x)-p(x)\>5)< Cl e- c ^ 5 ^ 2 , c^ n <S<A. (3.1) 
pev 

Moreover, we say that an estimator p n is (ip, ^)-loca!ly pointwise convergent in T>(rj), 
uniformly over V , if it is pointwise convergent at a rate (ip n ) uniformly over V and there 
exists positive constants C3, C4, c v such that for Q-almost all x € L)(rf) , we have 

8wpW(\p n (x)-p(x)\>6)<c 3 e- c *W^\ c v ip n <S<A. (3.2) 
pev 

The following theorem states that it is possible to derive fast rates of convergence under 
the 7-exponcnt for plug-in estimators constructed from a locally pointwise convergent 
estimator of the density. 
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Theorem 3.1. Fix A > 0, A > and let V be a class of densities on X . Let ip = (tp n ) 
and tp = (tp n ) be two positive, monotonically non-increasing sequences such that 

fn V ip n = o ( — -== ] and cp n > Cn - ^ for some C, /.t > 0. 

Let p n be an estimator of the density p constructed from data arising from p £ V and 
such that Q(p n > A) < M, almost surely for some positive constant M . Assume that p n is 
(if, ?p) -locally pointwise convergent inT>(rj), uniformly overV . Then, if p has ^-exponent 
at level A for any the plug-in estimator Y based on p n , with offset £ n , satisfies 

supE[d H (T p (X),f)]<Cipi 1+ ^ for£ n <Cip n , (3.3) 

pev 

supE[d A (r p (A),f)] <C(^„7bi^) 7 for£n = ct<p n y/logn (3.4) 
per 

for n > no = rio(A, rj, <p, ip, eq, c v , c^) and where, in (3.4), the constant q must be chosen 
large enough so that c\ > pLj/c^. 

Before giving the proof of the theorem, we comment on its meaning. First, note that 
the main consequence of (3.2) is that \p n {x) — p(x)\ is of order (p n for any x in the 
neighborhood T){tt). That is, p n is a good pointwise estimator of p in this neighborhood. 
Equation (3.1) is of the same flavor as (3.2) but in a weaker form. It entails that for x 
outside T>(t]), p n (x) is a consistent estimator of p(x) with rate of order ip n , which can be 
as slow as o((logn) -1 / 2 ) since it does not appear in the rates (3.3) or (3.4). This confirms 
the intuition that the density needs to be accurately estimated only in a neighborhood 
of A, whereas, outside this neighborhood, it is sufficient to know whether the density is 
greater or less that A. Finally, note that the constant c& can be constructed from an upper 
bound on the parameter 7 - and not 7 itself - which will be available in the particular 
context of Section 4. Therefore, Y remains adaptive to the parameter 7. 

As already mentioned in Section 2.2, we can see that the offset employed when the 
performance is measured using pseudo-distance dn is negligible with respect the one 
employed with d& . Actually, a plug-in estimator without offset does the job just as well 
when dn is employed. 

The proof of the theorem relies on the following lemma. Its proof is postponed to 
Section A.l in the Appendix. 

Lemma 3.1. Under the assumptions of Theorem 3.1, for any offset £ > 0, the plug-in 
estimator with offset £ > satisfies 

supE[d H (Y p (X),Y e )} <C(^ n y £)^\ (3.5) 
pev 

for some positive constant C . 
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We now turn to the proof of Theorem 3.1. 

Proof of Theorem 3.1. Note first that (3.3) is a direct consequence of Lemma 3.1 
applied with i = i n < C(p n . 

To prove (3.4), we apply Proposition 2.1, whose conditions are satisfied. Indeed, using 
the Markov inequality, we get 

Q({p n > A + £ n }A{p > A}) < Q(p n > A) + Q{p > A) < M + A" 1 

and we choose Lq = M + A -1 . Proposition 2.1 yields 

E[d A (r p (A),f )] < EQ(r p (A)Af n{ P = A}) + C(tp n Vlog^) 7 , (3.6) 

where the second term in the right-hand side was controlled by successively applying 
the Jensen inequality and Lemma 3.1 with £ = cgip n ^J\ogn, as prescribed in (3.4). To 
conclude the proof, it is sufficient to observe that by Fubini's Theorem and assumption 
(3.2), we have 

EQ(r p (A)AF n {p = A}) < c 3 Q( P = \) e - c ^'^ 2 < Cn~^ < C^. 
Together with (3.6), this inequality yields (3.4), which concludes the proof. □ 

In the next section, we verify that kernel density estimators are (tp, VO-locally pointwise 
convergent in a neighborhood of A, uniformly over a class of locally Holder smooth 
probability densities. 

4. Optimal rates for plug-in estimators with offset 
based on kernel density estimators 

In the remainder of this paper, we fix the measure Q to be the Lebesgue measure on R d , 
denoted by Lebd- 

In this section, we derive exponential inequalities of type (3.1) when the estimator p n 
is a kernel density estimator and the density p belongs to some Holder class of densities. 
We begin by giving the definition of the Holder classes of densities that we consider. 

4.1. Holder classes of densities 

Fix /3 > and A > 0. For any d-tuplcs s = (si, . . . , Sd) € N d and x — (x\ , . . . , xa) G X , we 

define |s| = Si H |-Sd, si = Si! • • ■ s^! and X s = x^ 1 ■ ■ ■ x d d . Let D s denote the differential 

operator 
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For any real- valued function g on X that is [_/3J -times continuously differcntiable at point 
xq G X, we denote by g^J its Taylor polynomial of degree \_f3\ at point xq: 



Fix L > 0,r > and denote by E(/3, L, r, xq) the set of functions g: X — > R that are 
L/3J -times continuously differentiable at point ccq and satisfy 



The set E(/3,L,r, xo) is called the (/3, L,r,xo)-locally Holder class of functions. We now 
define the class of densities that are considered in this paper. 

Definition 4-1- Fix f3 > 0, L > 0,r > 0, A > and 7 > 0. Recall that T>{rf) is the neigh- 
borhood defined by 

V(j]) = {pe{X-r],X + i])}, i]>0. 

Let Vy,{P, L, r, A, 7, /3', L*) denote the class of all probability densities p on X for which 
there exists r\ > such that: 

(i) p € E(/3, L, r, xo) for all xq E L)(r]), apart from a set of null Lebesgue measure; 

(ii) 3/3' > such that p £ E(/3', L, r, Xo) for all xq 4- ^ > ( r ?)> apart from a set of null 
Lebesgue measure; 

(iii) p has 7- exponent at level X with respect to the Lebesgue measure; 

(iv) p is uniformly bounded by a constant L* . 

We will often prefer the compact notation Vs(/3, X, 7), or simply Vs, when either the 
parameters are clear from the context or their value does not affect the results. 

The class Vz(/3, A, 7) is the class of uniformly bounded (iv) densities that have 7- 
cxponent at level A with respect to Lebd (iii) and that are smooth in the neighborhood 
of the level under consideration (i). As usual in nonparametric estimation, the parameters 
L,L* and r will affect only the constants in the rates of convergence presented below. 
However, the smoothness parameter /3' in condition (ii) which is expected to control the 
rate of convergence will also affect only the constants and therefore does not appear in 
the compact notation Vs(/3, A, 7). Indeed, j3' > can be arbitrarily close to and this will 
not affect the rates of convergence. Actually, the role of condition (ii) is to ensure that 
any density from the class can be consistently estimated at any point with an arbitrarily 
slow polynomial rate. 

The class of densities Vt, is similar to the class of regression functions considered 
in Audibert and Tsybakov (2007). However, besides the additional assumption that func- 
tions in Vy: are probability densities, the main improvement here is that the regularity of 
a density in Vt. can be arbitrarily low outside a neighborhood of the level under consid- 
eration, yielding slower rates of pointwise estimation. We prove below (cf. Corollary 4.1) 




( 



) 



H<L/3J 



\g{x)-gf \x)\<L\\x-x Q f Vx € £(x , r). 
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that optimal rates of convergence for DLSE are possible for this larger class of densities, 
which corroborates the idea that the density need not be precisely estimated far from 
the level A. 

The next proposition can be derived by following the lines of the proof of Proposi- 
tion 3.4 (fourth item) of Audibert and Tsybakov (2005). 

Proposition 4.1. If "f(/3 A 1) > 1, either T has empty interior or its complement T c 
does. Conversely, ifj(/3A 1) < 1, then there exist densities such that both T and T c have 
non-empty interior. 

4.2. Exponential inequalities for kernel density estimators 

To estimate a density p from the class Vy,{P, A, 7), we can use a kernel density estimator 
defined by 



where h > is the bandwidth parameter and K : X — > R is a kernel. This choice is not the 
only possible one and all we need is an estimator that satisfies exponential inequalities as 
in (3.1) and (3.2). The following lemma states that it is possible to derive such exponential 
inequalities for a kernel density estimator with a /3*-valid kernel where /3* > 0. The 
definition of f3- valid kernel is recalled in the Appendix, Definition A.l (see also Tsybakov 
(2004) for example). 

Lemma 4.1. Let P be a distribution on W 1 having a density p with respect to the 
Lebesgue measure and such that \\p\\oa < L* for some constant L* > 0. Fix /3 > ; /3* > ft, 
L > 0. r > and assume that p £ S(/3, L, r,xo). Let p n be a kernel density estimator with 
bandwidth h > and ft* -valid kernel K , given an i.i.d. sample X\, . . . ,X n from P. Set 



where c 5 = f \\t\\P K (t) dt and c 6 = l/(16L*||i ; r|| 2 ). 
The proof is given in Section A. 2 of the Appendix. 

We can therefore apply Theorem 3.1. For the appropriate choice of h, it yields the 
following corollary. 

Corollary 4.1. Let Q be the Lebesgue measure on M. d . Fix positive constant j3, L, r, A, 7, /?', L* 
and assume that j(/3 A 1) < 1. Consider the class of densities = V'siP, L, r, A, 7, /?', L*) 
and define /3* = /3V/3'. 




(4.1) 



6L*\\K\ 



\\K\\ X +L*+L /||t||/9jr(t)dt - 
Then, for all S,h<r such that A > <5 > 2Lc^h^ > 0, we have 

n\Pn(x ) -p(x )\ >S}< 2cxp(~c 6 nh d 5 2 ), 
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Let r be the plug-in estimator with offset £„ based on the estimator p n defined in (4-1 ) 
with P* -valid kernel K and bandwidth parameter h n > 0. Then, for any c/j > and for 
C( > max((c6C^) _1 , 1), we have 



sup E[d H (T p (X),f)] < Cn-^WW+Q 



sup E[d A (Tp(A),f)] < C 




-7/3/(2/3+d) 



U„ = c /l n- 1 /(^); 

r^„ = Qn-^^+ d ) A /I3^, 
U„=c, 1 (n/logn)- 1 /(2^). 



Proof. Define ip n = (nh d ) 1 ^ 2 and ip n = h^ > C7i^ and consider separately the cases 
h„ = Chn~ 1 /( 2 P +d * > and h n = Ch(n/ \ogn)^ 1 ^ 2/3+d \ We have, respectively, 

fn = Cvn-VW+Q < Chi and <p n = Cn -W^+d) (logrl) -W2)/(2/J+ d ) < ^ 

Therefore, in both cases, > C7i£ > C(n^)- 1 / 2 and a direct consequence of Lemma 4.1 
is that the kernel density estimator with /3*-valid kernel K and bandwidth parameter 
/i„ > is (93, ^-locally pointwise convergent in T>(r)), uniformly over Vy, in both cases. 

We also need to check that for such an estimator we have Lebd(p„ > A) < M, almost 
surely for some M > 0. Note that since K <G ii(R d ), we have 

00 >/ \K(x)\dLeb d (x) > / \Pn(x)\ dLeb d (a;) > \Leb d {p n > A}. 

Hence, the condition is satisfied with M = A -1 J \K\. 

Let us finally check that the aforementioned choice of q is compatible with the as- 
sumptions of Theorem 3.1. Since <p n = c v n" l3 ^ 2 ^ +d \ we can take [i = (3/(2/3 + d). We 
need to check that c 2 > ^7/ (cecf). Since ci > 1, we have c 2 > q and 

M7 < P _ PV1 < 1 <c 

c 6 c d ~ (2p + d)(P A l)c 6 c d (2/3 + d)(c 6 c d ) ~ c 6 c d ~ l ' 

Since all the conditions of Theorem 3.1 are satisfied, this concludes the proof. □ 



5. Minimax lower bounds 

The following theorem shows that the rates obtained in Corollary 4.1 are optimal in a 
minimax sense. 

Theorem 5.1. Let the underlying measure Q be the Lebesgue measure on Fix A > 0, 
let (3, 7 be positive constants such that 7/? < 1 and consider the class of densities Vy, = 
Vt,(/3, A, 7). Then, for any n> 1 and any estimator G n of T p (X) constructed from the 
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sample X\, . . . , X n , we have 



sup E[dir (r„(A), G n )] > Cn-^WW +d \ (5.1) 

/ N- 7 /3/(2£+d) 

sup E[d A (r p (A),G n )] > C . (5.2) 



Proof. Fix n > 2 and consider the quantities 



logn 



n = Cn-^/^+ d ) and ^„ = C^ 



log n 



-7/3/(2/3+d) 



Our goal is to find two families of densities Af e and that are in V and which satisfy 
the conditions of Lemma A. 2. Note first that although it does not appear in its notation, 
the pseudo-distance du depends on the underlying density p and this will be inconvenient 
in proving minimax lower bounds. As a result, we are going to prove (5.1) for a pseudo- 
distance d e which does not depend on p and, for any measurable G C [0, l] d , satisfies 

d H (r p (X),G)>C(d e (T p (X),G))^^ foranypeTV;. (5.3) 

The pseudo-distance d e will be defined in (5.6). We will use Lemma A. 2 for V = Vs, with 
e = g n , M =J\f e , d = d e to prove (5.1), and with e = x n , J\f = J\f x , d = c?a to prove (5.2). 

For both families N e and Af^, the construction begins as follows. Assume, without loss 
of generality, that A = 1. Let q > 4 be an integer, to be specified later, and consider the 
regular grid Q on [0, l] d defined as 

JY 2fci + l 2k d + l \ rn 

G = <{ ),kie{0,...,q-l},i = l,...,d 



Let N denote the unique integer in the pair {q d /2, (q d — l)/2} and denote by {gj}i<j<2N 
a collection of 2N distinct elements of the grid, the choice of indexing being of no impor- 
tance for what follows. For any j = 1, . . . , 27V, define the Euclidean balls Bj = B(gj,K,), 
where n = 1/q. 

Let 4>p : M d — > R+ be a smooth function defined as follows. If j3 < 1, the function <j)p is 
defined as 

Cp{l-\\x\\Y, if0<||x||<l, 

o, ifNI>i- 



(j)fs{x) 

If /3 > 1, the function tpp is defined as 



C^-MMH, if0<||x||<l/2, 
M x ) = {Cp{l-xf, if 1/2<||x-||<1, 

0, if||x-||>l, 

where, in both cases, < Cp < 1/2 is chosen small enough to ensure that \4>p(x) 
M^l <L\\x- x'f for any x,x' G R d . 
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Then, for any w = (wi, . . . , lu^) £ {—1,0, 1}^, define on [0, l] d the function 

N 

Pu(x) = 1 + Vj [<fj {x) - (p N+j (x)] , 
i=i 

where tpj(x) = K P <f>{\x - g^}/ k)1 {x£Bj} . 

Define the integer m = [q d ,l3 /2\ + 6 so that m satisfies 6 < m < 6N. 

Let fl e and be two subsets of {—1,0, 1} N such that J2j l w jl = 171 f° r an y w = 
(u>i, . . . ,lun) G ^ e U 57^. Now define the families N B and 7V>c a s 

Af e = {p^,uj eO, e }, Af >c = {p L j,uj ettx}. 

The sets f2 e and fi^, will be chosen in order to fulfill the conditions of Lemma A. 2. 

First condition. Af C Vs(/3, 1, 7) . 

First, note that for any ui £ {— 1,0, 1}^, ||p w ||oo < 2 and p u £ 2, 1, x) for any 
x £ [0, l] d . Therefore, it remains to check that p u has 7-exponent at level 1 with respect 
to the Lebesgue measure. We now show that it is sufficient to have 53 . l^j I < 2m for this 
condition to hold, which is satisfied for p^ either in M e or in N x . We have 

N 

Lcb d {x :0<\p u (x)-l\<e) = 2^ l{| Wi |=i} Lcb rf (a; :0 < \p u (x) - 1| < s,x £ Bj) 

j'=i 

< 4mLeb d (x:0 < - ffi]/«) < £k _/3 ) 

= 4m / l^/^-c^-^da;. 
•/B(o,i) 

The last term in the previous system of equations is treated differently depending on 
whether /3<lor/3>l.If/?<l, we have 

4m / $-{4>(x/ K )<eK-f>} da; < C , (mK ci l {e>f;fi} + m[n d - (« - e 1/,9 ) d ]l{e< K /3}) 
J8(04) 

< C(/s^l {e ,>^ } +m K d - 1 e 1 ^l {K>el/ ^ } ) 

(5.4) 

<c(^ + ^-V^i {K>£l/8} ) 

< Ce 7 , 

where we have used the fact that j/3 — 1 < to bound the second term in the penultimate 
inequality. 

We now treat the case (3 > 1. Note that integration over x such that ||x|| > k/2 can be 
treated in the same manner as in the case (3 < 1. The integral over x such that ||x|| < k/2 
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is trivially upper bounded by a term proportional to the volume of the ball S(0, ft/2). It 
yields 



4m / t {( j>( x /K)<eK-f<}^x< C(mK d t {E>c g( K / 2 )0} +mn d 1 £ 1//? 1{ £ <c«(k/2)/3}) 
J 8(0,1) 

< C(K^l {K < 2{£/Cp)1/g} +mn d - 1 e 1 ^t {K>2(£/c ^/, ] ) (5.5) 

< Ge 7 . 

As a result, both N e and Af x are subsets of Ps(P, 1,7) and the first condition is 
satisfied. 

Second condition. d(T p , T q ) > eV p,q £ Af,p ^ g. 

It will become clearer that we need to bound from below the Hamming distance be- 
tween ui and a/, defined for any uj,uj' £ { — 1,0, 1}^ by 

N 
3=1 

Let us now treat separately the class N e and the class J\f x . 

We begin with N e and define I — [m/6\ so that m > 61 > 6. The subset fl e is chosen 
to be of the form 

n e = {u = {J m \ o, . . . , o),J m] g nH}, 

where fig" 1 -' is a subset of { — 1, l} m . For any w £ fi , we clearly have 

N 

|cjj-| = m < 2m. 

i=i 

We are now in a position to define d g by 

m 

d e (G u G 2 ) = ]T Leb^d AG 2 n (flj U B N+j )). (5.6) 

3=1 

It is easy to check that (5.3) is satisfied since for any p £ N e and any measurable G C 
[0, l] d , we have 

d e (T p , G) = Leb d (r p AG n {p A}) < C(d H (T p (X), G))^ 1+7 \ 
where in the last inequality, we have used Proposition 2.1. 
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The set fig 1 ™' is extracted from {— l,l} m using Lemma A.l, which guarantees that 
there exists such a set with cardinality s g > 2 satisfying 

log(s e ) > Cl\og(m/l) > Cm 

and 

m 

YlH"^} >l + l>m/6, 

i=i 

It yields 

d e (T Pbj , T Pu , ) = 2 Lcb d (Si)p(w, a)') > Gmn d > 2q n 

when q = 4Ln 1 /(2/3+d)j_ 

We now define the set Af^ as follows. Let £l x be a subset of {0, 1} N with cardinality 
s>£ > 2, extracted using Lemma A.l such that 

A' 
3 = 1 

for any ui in this extracted subset and to which € {0, 1}^ has been added. It satisfies 

log(s^) > Cm\og(N/m) > Cm\og(q) 

and 

N 

^l {wj ^ } >m + l. 

It yields 

N 

d A (Tp ut Tp ul ) = 2Leb d {B 1 )^2l {uj jt u ' j} > CK d m>2x n 

3 = 1 

when q = 4L(n/logn) 1 /( 2 ' 9+ti )J. 

Third condition. max wG o K(p UJ ,p LUo ) < Clog(card(A/ r )). 

Note that for the above choice of f2 e , we have log (card (Af e )) > Cm and need only prove 
that 

max K(p LJ ,p Lj/ ) < Cm. 



Define £j(x) = l -Pj{x) — 9?j+jv(x). For any p^^Pu:' €Af, using the inequality 

1 + a 



lQ g( T~~T )(l + a)<(a-b)+2(a-by 
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for any a, b such that \a\ < 1, \b\ < 1/2, we have, for q = 4Ln 1/(2 ^ +d) J , 

K( Pui ,p u ,)=njr f iog(l±^M)(l + w ^(x))dx 



fit /. 

<n£/ 2[( Wj --w;.)^(s)] a da; 
<2nm[ ^(x)d* 



< 2nm/s (2 *' +<, > / 2 (.T)da; 

•>B(0,1) 

< Cm, 

where in the third line, we used the invariance by translation of the family {fj\j. 
We can therefore apply Lemma A. 2 to prove that 

snpE[d e (T p (X),G n )]>Cg n . 

This inequality combined with (5.3) and the Jensen inequality yields (5.1). 
In the same manner, to prove (5.2), it is sufficient to prove that 

max K(pu,Po) < Cmlogq. 
This follows from the following sequence of inequalities: 

N 



j=l J BjUB N+j 
N 

J=1 J Bl 



<2nm^ 2f3+d) / 4> 2 {x)dx 

< Cm log n 

< Cmlogq 

for q = 4Ln 1 /' 2,3+d 'J , m = C(n/logn)( d_7,3 '/( 2,3+d ) and where, in the first inequality, we 
have used the convexity inequality log(l + x) < x, together with the invariance by trans- 
lation of the family {<Pj}j- □ 
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Appendix 

Several results that can be omitted in a first reading of the paper are collected in this 
appendix. 

A.l. Proof of Lemma 3.1 

To prove (3.5), we use the same scheme as in the proof of Audibert and Tsybakov (2007), 
Theorem 3.1. Recall that T denotes the plug- in estimator with offset < I < C(,f n y/\ogn 
and that f Ar = (f n T c ) U (f c (IT). It yields 

E[d H {T,f)] = E [ |p(x)-A|dQ(ar)+E / \p(x) - A| dQ(x). 
Jrnr c ./r^nr 

Define two sequences 

a n = c a (ip n V£) and /?„ = cp (ip n V ip n ) ■\/log n, n>2, 

where c a = 2 max(c ¥ ,, c^,, 1) and cp > c a max(Q, 2/x(l + r y)/c2, 1). Let no be a positive 
integer such that a n < [3 n < n A Eq A A for all n>nQ. In the remainder of the proof, we 
always assume that n>nQ. Consider the following disjoint decomposition: 

f c n r = {p„ <\ + e, P >\}cA 1 uA 2 u A 3 , (A.l) 

where 

Ai = {pn <A + £,A<p<A + a n }, 

A 2 = {Pn < A + 1, A + a„ <p< A + /3„}, 

A3 = {]5„<A + £,p>A + /3„}. 

Observe that A\ C {0 < |p - A| < a„}. This yields 

e/ b(^)-A|dQ(x)<Q„g(A 1 )<co(a„) 1+ T, (A.2) 

J A x 

where, in the last inequality, we used the 7-exponent of p. Define J n = [log 2 ( ) J + 2, 
where [y\ denotes the maximal integer that is strictly smaller than y > 0. We can then 
partition A 2 into 

Jn 

A 2 =\Jx j nA 2 , 

3 = 1 

where 

Xj = {p n <X + f,\ + 2 j - 1 a n <p<\ + 2 j a n } n V(n A e ). 
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Hence, 

E f \p(x) - X\dQ(x) = ^E [ \p(x) - A| dQ(x). (A.3) 

Now, since £<a n /2, we have 

Xj C {\p n -p\ > 2 J ~ 2 a n } n {\p(x) - A| < Va n }. 

Using Fubini's theorem and the previous inclusion, the general term of the sum in the 
right-hand side of (A.3) can be bounded from above by 

2 J a„ / V[\p n {x)-p{x)\>2^ 2 a n ]t { Q < \ p ( x) _ X \ <2 j an] AQ{x). 

Note that for any 1 < j < J n , we have c^ifn < 2 J ~ 2 a!„ < f3 n < A. Now using (3.2) and the 
fact that p has 7-exponent at level A, we get 




(A.4) 

<C{a n ) 1+ \ 

where we have used the fact that ip n < a n . 

We now treat the integral over A3 by first noting that Q(Az) < 1/A, from the Markov 
inequality. Next, using Fubini's theorem and the fact that /3„/2 > £ and /3„/2 > tyip n , 
we obtain 

e[ \p(x) - X\dQ(x) < [ \p{x)-\\F[\p n {x)-p(x)\>p n /2]dQ{x) 

JA 3 JA 3 

< 2d exp(-c 2 (/V(2^„)) 2 ) < 2 Cl n- C2C i/ 2 . 
Using the fact that c| > cp > 2/i(l + j)/c2, we get 

E f \p(x) - A| dQ(x) < 2cm-^ 1+ ^ < Ca^ ] . (A.5) 

J A 3 

In view of (A.l), if we combine (A. 2), (A.4) and (A.5), we obtain 

.. / \ P (x)-\\dQ(x)<Ca^\ 
Jf c nr 

In the same manner, it can be shown that for n >uq, 

e[ \p(x)-\\dQ(x)<Ca^l 
Jf nr<= 
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The only difference with the part of the proof detailed above is that in the step that 
corresponds to proving the equivalent of (A. 5), we use the assumption that Q(p n > A) < 
M almost surely, in place of the Markov inequality. 



A. 2. Proof of Lemma 4.1 



Proof. For any Xq € M. d , 



\Pn(xo) ~p{xo)\ 



i=l 



with 



The expectation of Zj(xo) is the pointwisc bias of a kernel density estimator with band- 
width h. Under the assumptions of the theorem, it is controlled in the following way: 



\^Zi{x Q )\<Lc^. 



Indeed, 



|E2i(ao)| = 



1 ft 
hAh 



dt 



p(x +t) -p(x ) 
K(t)[p(x +ht)-p{x )}dt 
K{t) [p(x a + ht) - p^> (x Q + ht)} dt 
+ [ K(t)[p^(x Q + ht)-p(x )}dt 



(A.6) 



To control the first term in the right-hand side of (A.6), note that since K has support 
[— l,l] d , for any h<r/^fd, we have xq + ht £ B(xq, r) for any ie [— l,l] d . Thus, using 
the fact that p is in H(/3,L,r, xq), we have 



<L / \K(t)\\\htf dt. 



K(tMx + ht)-pW(x + kt)]dt 



Now, since K is a -valid kernel (cf. Proposition A. 2) and pi^J —p(xo) is a polynomial 
of degree at most |_/?J with no constant term, the second term in the right-hand side 
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of (A. 6) is zero. Therefore, we have 

\EZi(x )\ < Lh? / \K(t)\\\t\fdt for any h<r. 



Now denote, for simplicity, Zi = Zi(xo) and let Zi be the centered version of Z^. When 
Lc^hr < 5/2, we then have 



»{\Pn{xo)~p(xo)\>6}< 



< p< - 

n 



E^ 

n 

E^ 

i=l 



>6-Lc 5 h p 



5 



The right-hand side of the last inequality can be bounded by applying Bernstein's in- 
equality (sec Devroye et al. (1996), Theorem 8.4, page 124) to Zi and — Zi successively. 
For h < 1 , we have 

\Zl\ < \\K\\ oo h- d + L* + Lc 5 h <c 7 h- d , 
where c 7 = ||oo + L* + Lc 5 and 

Var{Z. ( } < h' d J K(u) 2 p(x + hu) du < c 8 h~ d , 

where c% = L*||-£sr|| 2 . Applying Bernstein's inequality now yields 



F{\p n (x )-p(x )\ ><5}<2exp 



i{5/2? 



2(c 8 h- d + c 7 h- d S/6) 
< 2exp(-c 6 nh d S 2 ) 

for any 5 < A and where A = 6c 8 /c 7 and cq = 1/ (16cs). 



□ 



A. 3. Equivalent formulation for the 7-exponent condition 

The following proposition gives an equivalent formulation for the 7-exponent condition. 

Proposition A.l. Fix A > 0,7 > and Lq > 0. 

Define C = C(X) — {p — A}. The two following statements are equivalent: 

(i) 3c > and Eq > such that for any < e < Eq, we have 

Q{x e X : < \p(x) - A| < e} < ce 1 ; 
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(ii) 3d > and s\ > such that for any < e < E\, we have 

Q{xeX:Q< \p{x) -\\<e}<L Q , 
and for all G C X \ C satisfying Q(G) < Lq, we have 

/ r \ 7/(1+7) 

Q(G) < c' (J \p(x) - A| dQ(x)J . (A.7) 

Proof. The proof of (i) =>■ (ii) essentially follows that of Tsybakov (2004), Proposition 1. 
Define 



£1 =£ A 

Observe that for any < £ < £1, we have 



T \ Vt 



c(l+7) 



Q-jx e^:0< |p(sc) - A| < £} < c£ 7 < cej = < L Q- 

1 + 7 

Define A £ = {x : \p(x) — A| > e} for all < £ < £0. For any measurable set G C X \ C, we 
have 

\ P (x)-X\dQ(x)>eQ(GnA £ ) 

>e[Q(G)~Q(A c e nC c )} 

>£[Q(G)-C£ 7 ] Vc>c, 

where the last inequality is obtained using (i) . Maximizing the last term with respect to 
£ > 0, we get 



/■ \ 7/(1+7) / \ 7/(1+7)/ 1 \ 1/(1+7) 



c -1/(1+7). 



This yields (A.7) with c' = e~ 2 ^ Q c 1 ^ 1+1 \ Note that the maximum is obtained for e = 
( cCL+7) — £ ° ^ or surnc i en tly large c and (i) is valid for this particular e. 

We now prove that (ii) =>■ (i). Consider £1 > such that Q(A £ H £ c ) < Lq for any 
< £ < £1 and d > such that (A.7) is satisfied for any G C X \ C, Q{G) < Lq. Taking 
G = A c e n C c in (A.7) yields 

Q{x : < |p(x) - A| < £} = Q(A c e n £ c ) 

7/(1+7) 



<c'(7 | P (aO-A|dQ(aO > ) 
< c'(£Q(^n£ c )) 7/(1+7) . 
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Therefore, 

Q{x : < \p(x) -\\<e}< (c') 1+7 e 7 . 
This inequality yields (i) with eq = £\ and c = (c') 1+7 . □ 

A. 4. On /3-valid kernels 

We recall here the definition of /3-valid kernels and state a property that is useful in the 
present study. 

Definition A.l. Let K be a real-valued function on M d , with support [—1,1]". For 
fixed (3 > 0, the function K(-) is said to be a /3-valid kernel if it satisfies J K = 1, J \K\ P < 

00 for any p> 1, J \\t\\°\K(t) \ dt < oo and, in the case \J3\ > 1, it satisfies J t s K(t) dt = 
for any s = (si, . . . , sd) € N d such that 1 < Si + • • • + Sd < L/3J ■ 

Example A.l. Let f3 > 0. For any /3-valid kernel K defined on M. d , consider the product 
kernel 

K(x) = K(xi)K(x 2 ) ■ ■ ■ K(xd)l x€ [-i tl ]d 

for any x = {x\, . . . , Xd) S R d . It can then be easily shown that K is a /3-valid kernel on W l . 
Now, for any /3 > 0, an example of a 1-dimensional /3-valid kernel is given in Tsybakov 
(2009), Section 1.2.2, the construction of which is based on Legendre polynomials. This 
eventually proves the existence of a multivariate /3-valid kernel for any given /3 > 0. 

The following proposition holds. 

Proposition A. 2. Fix /3 > 0. If K is a [3-valid kernel, then K is also a fi' -valid kernel 
for any < ft < (3. 

Proof. Fix (3 and (3' such that < (3' < (3. Observe that 1(3' \ < [/3J yields that if [/3'J > 1, 
then for any /3-valid kernel K, we have J t s K(t) dt = for any s = (si, . . . , Sd) such that 

1 < si H + Sd < [/3'J ■ It remains to check that 

/ \\t\f\K(t)\dt<oo. (A.8) 

Consider the decomposition 

/ \\tf\K(t)\dt= ( \\tf\K{t)\dt+ [ \\t\f\K{t)\dt 
Jm d ^PII<1 ^ || * || > i 

< / \K(t)\dt+ [ \\tf\K(t)\dt. 

JR d "'l|t||>l 
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To prove (A. 8), note that since if is a /3-valid kernel, we have J Rd \K(t)\ di < oo and 

\\tf\K(t)\dt< f \\t\f\K(t)\dt<oo. □ 



||t||>l JR d 

A. 5. Technical lemmas for minimax lower bounds 

We collect here technical results that are used in Section 5. For a recent survey on the 
construction of minimax lower bounds, see Tsybakov (2009), Chapter 2. We first give a 
lemma related to subset extraction. 

Fix an integer k > 1, and for any ui = (wi, . . . ,uik) and u' = (ui[, . . . ,cd' k ) in { — 1, l} fc or 
in {0, l} k , define the Hamming distance between ui and u/ by 



i=i 



The following lemma can be found in Rigollet (2006), Lemma A. 2. It is a straightforward 
corollary of Birge and Massart (2001), Lemma 4, stated in a way which is more adapted 
to our purposes. 

For any integers N,£> 1 , define 



or, equivalcntly, 

^=|cje{-l J l} JV ,^(a;,+l)/2 = ^|. 

Lemma A.l (Birge and Massart (2001)). Let N and £ be two integers such that 
N > 6£ > 6. There then exists a subset Q of H21 such that 

N 

3=1 

and s = card(51) satisfies 

log(s)>Cilog(N/£) 
for some numerical constant C > 0. 



The next lemma can be found in Tsybakov (2009), Theorem 2.7, and is stated here in 
a form adapted to the DLSE framework. It involves, in particular, the Kullback-Leibler 
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divergence between two probability densities p and 5 on K d , denned by 

[ +00, otherwise. 

Lemma A. 2. Let d be a pseudo-distance between subsets of X C Mr. Let V be a set of 
densities and assume that there exists a finite subset Af C V with 2 < card(A/") = s < 00 
such that 

d(T p (X),T q (X))>2 £ V Pl qeAf,p^q, (A.9) 

and 

maxif( 9 ,p )<Clog( S ) (A.10) 
qeN 

for some po <G Af . 

There then exists an absolute positive constant C such that for any estimator G n of 
r p (A) constructed from the sample Xi, . . . ,X n , we have 

supE[d(r p (A),G„)]>C'e. 
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